Introduction

This case study is part of my Google Data Analytics Capstone Project, where I’ll guide you through the six phases of Data Analytics: Ask, Prepare, Process, Analyze, Share and Act. Throughout these phases, I’ll demonstrate my ability to identify business tasks and key metrics, clean and process data for analysis, apply appropriate statistical techniques for data analysis and create interactive visualization using ggplotly. The final steps, including cleaning, analysis and visualization of data, will be executed using R Studio.

About the Company

In 2016, Cyclistic launched a bike-share program in Chicago, which has since expanded to include 5,824 bicycles across 692 stations. Cyclistic offers flexible pricing plans, attracting a wide range of customers. However, finance analysts have found that annual members are more profitable than casual riders. To boost future growth, the marketing team aims to convert casual riders into annual members. To achieve this, they need to understand the differences in bike usage between the two groups.

ASK

Business Task

Identify marketing strategies to convert casual riders into annual members. To do this, it is important to analyze how annual members and casual riders used Cyclistic bikes differently using the previous 12 months Cyclistic’s historical data.

Metrics that could be used to compare the ride patterns between casual and member riders:

  • Usage by type of bike (Docked vs Classic vs Electric bikes)
  • Duration of the trip in minutes
  • Peak time or day of the week
  • Most popular start and end stations

The key stakeholders for this project include the Director of Marketing, the Cyclistic executive team, and the marketing analytics department.

PREPARE

Data Source

For this analysis, data from the past 12 months (September 2023 - August 2024) was utilized. This data was stored using third party cloud services. Each monthly dataset contains essential details, including membership status (casual or member), type of bike used (classic, docked & electric), trip start and end times, and information about the start and end stations, along with their latitude and longitude coordinates.

We will use the following R packages:

  • tidyverse
  • lubridate
  • dplyr
  • rstatix
  • ggplot2
  • gtsummary
  • ggplotly
  • sf

PROCESS

To begin processing and cleaning the data, we will first import and merge the data sets.

Step 1: Import and merge datasets

# Define the path to the directory containing the CSV files
path <- "C:/Users/Smrut/OneDrive/Documents/Cyclistic Case Study/Data"
#path <- "C:/Data Analytics Portfolio/Cyclistic Case Study/Subset_data"

# Get a list of all CSV files in the directory
files <- list.files(path = path, pattern = "*.csv", full.names = TRUE)

# Use lapply to read all files into a list of data frames
data_list <- lapply(files, read.csv)

# Optionally, combine all data frames in the list into one data frame
combined_data <- do.call(rbind, data_list)

# Dimensions of the data frame
dim(combined_data)  
## [1] 5699639      13

After combining the data sets for the last 12 months, we have a data set with 5699639 rows and 13 variables.

# See the first 5 rows of the dataset
head(combined_data, n=5)
##            ride_id rideable_type          started_at            ended_at
## 1 011C1903BF4E2E28  classic_bike 2023-09-23 00:27:50 2023-09-23 00:33:27
## 2 87DB80E048A1BF9F  classic_bike 2023-09-02 09:26:43 2023-09-02 09:38:19
## 3 7C2EB7AF669066E3 electric_bike 2023-09-25 18:30:11 2023-09-25 18:41:39
## 4 57D197B010269CE3  classic_bike 2023-09-13 15:30:49 2023-09-13 15:39:18
## 5 8A2CEA7C8C8074D8  classic_bike 2023-09-18 15:58:58 2023-09-18 16:05:04
##              start_station_name start_station_id               end_station_name
## 1   Halsted St & Wrightwood Ave     TA1309000061 Sheffield Ave & Wellington Ave
## 2        Clark St & Drummond Pl     TA1307000142     Racine Ave & Fullerton Ave
## 3 Financial Pl & Ida B Wells Dr           SL-010           Racine Ave & 15th St
## 4        Clark St & Drummond Pl     TA1307000142       Racine Ave & Belmont Ave
## 5   Halsted St & Wrightwood Ave     TA1309000061     Racine Ave & Fullerton Ave
##   end_station_id start_lat start_lng  end_lat   end_lng member_casual
## 1   TA1307000052  41.92914 -87.64908 41.93625 -87.65266        member
## 2   TA1306000026  41.93125 -87.64434 41.92557 -87.65842        member
## 3          13304  41.87506 -87.63314 41.86127 -87.65663        member
## 4   TA1308000019  41.93125 -87.64434 41.93974 -87.65887        member
## 5   TA1306000026  41.92914 -87.64908 41.92557 -87.65842        member
# Removing objects to free up disk space
remove(files) 
remove(data_list)

Step 2: Process and clean data using data wrangling techniques. We need to retain records with non-missing membership status and valid start and end stations, ensuring all variables are converted to their appropriate data types.

# install.packages("tidyverse")
# install.packages("lubridate")

# Load the packages
library(tidyverse)
library(lubridate)

trips_data <- combined_data %>% 
  filter(member_casual != '' & start_station_id != '' & end_station_id != '') %>% #excluding missing records with subscription status, start and end station details
  rename(bike_type = rideable_type) %>%  #rename column as bike_type
  mutate(bike_type = factor(bike_type, levels = c( "classic_bike", "electric_bike"),  labels = c("Classic bike", "Electric bike")), #assigning value labels to bike_type
         member_casual = factor(member_casual, levels = c("member", "casual"),  labels = c("Member", "Casual")), #assigning value labels to member_casual
         started_at = ymd_hms(started_at), # converting char to date time
         ended_at = ymd_hms(ended_at)) %>%  # converting char to date time 
        filter(start_lat < 42.02 & end_lat < 42.02) #excluding latitude with top coordinates greater than 42.0 since these communities are outsider Chicago boundaries

# Remove the combined_data to free up disk space
rm(combined_data)

Step 3: Identify and remove duplicate rows based on the ride_id and start time to ensure data accuracy and consistency.

## Check for duplicates based on ride id (Credit Card numbers which could be repeatedly used by a rider hence may not be unique per ride), start_station and start time
trips_data_1 <- trips_data[!duplicated(trips_data$ride_id , trips_data$started_at ), ] 

# Remove the trips_data to free up disk space
rm(trips_data)

Step 4: Calculate new variables for analysis:

  • Trip duration in mins (ended_at - started_at)
  • Time, date, weekday, month and year from the start date
#install.packages("hms") to extract the time from start date time variable
library(hms)
trips_data_2 <- trips_data_1 %>% 
  mutate(duration_trip = as.numeric(difftime(ended_at, started_at,units = c("mins"))), #Trip duration as diff between (ended_at - started_at) 
         time_trip = as_hms(started_at), #extract time from start date
         day_trip = day(started_at), #extract date from start date
         month_trip = month(started_at, label = TRUE, abbr = FALSE ), #extract month from start date
         year_trip = year(started_at), #extract year from start date
         week_day_trip = wday(started_at,label = TRUE, abbr = FALSE )) #extract weekday from start date

# Remove the trips_data_1 to free up disk space
rm(trips_data_1)

Next, we try to graphically identify outliers in the trip duration variable by plotting a histogram using a bin width of 1.

library(ggplot2)
library(scales)

# Identifying outliers in duration using a histogram
ggplot(trips_data_2, aes(x=duration_trip)) +
   ggtitle("Histogram plot for trip duration with outliers") + 
  xlab ("Trip duration in mins") +
  geom_histogram(binwidth=1, color="black", fill="white") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Step 5: Removing extreme outliers and arriving at the final dataset for analysis

In this step we identify and remove the extreme outliers. Essentially there are two categories of outlier: (1) outliers and (2) extreme points.

Values above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered as outliers. Values above Q3 + 3xIQR or below Q1 - 3xIQR are considered as extreme points (or extreme outliers).

Q1 and Q3 are the first and third quartile, respectively. IQR is the interquartile range (IQR = Q3 - Q1).

Using the identify_outliers() function in the rstatix package which flags the trip duration which are greater than 3xIQR. For this analysis, I decided to remove these records.

There were a few records with trip duration of less than a minute or negative duration, I decided to remove these as well assuming the rides were not really completed or the date times were not correctly captured respectively.

#install.packages("rstatix")
library(rstatix)

outliers<- trips_data_2 %>%
  identify_outliers(duration_trip)  #Identifying extreme outliers

outliers <- outliers[c("ride_id","is.extreme")] #Keeping column with ride_ids and extreme outliers

trips_data_3 <- merge(x = trips_data_2, y = outliers, by = "ride_id", all.x = TRUE) #Merging using left join 

#Final dataset for analysis
trips_data_4 <- trips_data_3 %>% 
  mutate(is.extreme = ifelse(is.na(is.extreme), "FALSE", is.extreme)) %>% #replacing the missing values with FALSE
  filter(!duration_trip <= 1 & # Excluding rides with a duration of less than 1 minutes or duration in negative, as it is assumed these rides were not completed or the date times were not correctly entered
           is.extreme %in% c("FALSE")) # Excluding the extreme outliers assuming data processing errors

# Histogram without outliers
ggplot(trips_data_4, aes(x=duration_trip)) +
   ggtitle("Histogram plot for trip duration without extreme outliers") + 
  xlab ("Trip duration in mins") +
  geom_histogram(binwidth=1, color="black", fill="white") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

After removing outliers, we can see that the trip duration is positively skewed with majority of the rides were taken for less than 20 mins.

# Remove datasets to free up disk space
rm(outliers)
rm(trips_data_2)
rm(trips_data_3)

Analyze

Step 5: Analyse the data

The following analysis was conducted to observe the ride patterns between member vs casual riders:

  • The summary of all rides
  • Compare the distribution bike type
  • Compare the duration of the trips
  • Compare the average ride duration by each day of the week
  • Compare the number of rides by each day of the week and time of the day
  • Most frequent start and end stations

Table 1: Summary of all ride characteristics

#install.packages("gtsummary")
library(gtsummary)

trips_data_4 %>%  
  select(bike_type, member_casual, duration_trip, day_trip, month_trip, week_day_trip) %>% #selecting vars for summary
  tbl_summary(
    label = list(bike_type = "Type of bike", member_casual = "Subscription status", 
                 duration_trip = "Trip duration (mins)", 'day_trip' = "Day of the trip", 
                 'month_trip' = "Month of the trip", 'week_day_trip' = "Week of the trip",
                 'year_trip' = "Year of the trip"), #assigning variable labels to display
    type = all_continuous() ~ "continuous2", #summary stat to continue on another row
    statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}")), #stats
    digits = (list(all_categorical() ~ c(0, 1))) #display percentage with 1 decimal
  )
Characteristic N = 3,966,7871
Type of bike
    Classic bike 2,640,104 (66.6%)
    Electric bike 1,326,683 (33.4%)
Subscription status
    Member 2,637,277 (66.5%)
    Casual 1,329,510 (33.5%)
Trip duration (mins)
    Mean (SD) 13 (10)
    Median (Q1 - Q3) 10 (6 - 17)
    Min - Max 1 - 55
Day of the trip
    Mean (SD) 16 (9)
    Median (Q1 - Q3) 15 (8 - 23)
    Min - Max 1 - 31
Month of the trip
    January 108,627 (2.7%)
    February 177,318 (4.5%)
    March 219,617 (5.5%)
    April 279,705 (7.1%)
    May 411,192 (10.4%)
    June 466,165 (11.8%)
    July 508,860 (12.8%)
    August 512,251 (12.9%)
    September 475,586 (12.0%)
    October 383,021 (9.7%)
    November 263,419 (6.6%)
    December 161,026 (4.1%)
Week of the trip
    Sunday 512,155 (12.9%)
    Monday 527,302 (13.3%)
    Tuesday 557,795 (14.1%)
    Wednesday 602,352 (15.2%)
    Thursday 579,634 (14.6%)
    Friday 572,646 (14.4%)
    Saturday 614,903 (15.5%)
1 n (%)

In the final dataset we had 3,966,787 ~ 4 million rides taken over the last 12 months with no missing data.

Some conclusion based on Table 1 for the all the rides taken:

  • Majority of the riders preferred to use Classic bike (66.6%) as compared to electric bike (33.4%)
  • Annual members (66.5%) were the dominant group of ride takers as compared to the casual riders (33.5%)
  • Median trip duration was 10 mins
  • August was the month with maximum rides
#Pie Chart to show the distribution of member and casual riders

tab1 <- table(trips_data_4$member_casual) #frequency table
prop1 <- as.data.frame(prop.table(tab1)) #proportion with denominator as member_casual var
prop1$percent <- as.numeric(format(round(prop1$Freq*100,1),nsmall=1)) # percentage calculation
colnames(prop1) <- c("member_casual","prop","percent") #renaming col names

# Pie chart
p1 <- ggplot(data=prop1, aes(x="", y=percent, fill=member_casual)) + 
geom_col(color = "black") +
  coord_polar(theta = "y")+ 
geom_text(aes(label = percent), size = 3, position = position_stack(vjust = 0.5), vjust = -1, color = "black") 

p1 +  theme_void() + theme(legend.key.size = unit(1.5, 'cm'), 
                           legend.text = element_text(size = 10)) + 
  scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) 

We calculate the sum of all trip duration for the all the rides and by subscriber status to depict the total minutes Cyclistic’s bike were used.

# Calculate the total trip duration

trips_data_4 %>%  
  select(member_casual, duration_trip) %>% #selecting vars for summary
  summarise(sum_duration = sum(duration_trip)
  )
##   sum_duration
## 1     51900904
# Calculate the total trip duration by subscription status

trips_data_4 %>%  
  select(member_casual, duration_trip) %>% #selecting vars for summary
  group_by(member_casual) %>% 
  summarise(sum_duration = sum(duration_trip)
  )
## # A tibble: 2 × 2
##   member_casual sum_duration
##   <fct>                <dbl>
## 1 Member           30609442.
## 2 Casual           21291462.

In last 12 months, total ride duration was 5,19,00,904 mins. Casual rider completed rides for 21291462 mins while members completed rides for a total of 30609442 mins.

Table 2: Comparison of Trip Duration Between Members and Casual Riders

trips_data_4 %>%  
  select(duration_trip, member_casual) %>%  #selecting vars for summary
  tbl_summary(
    by = member_casual,  #grouping variable
    label = list(duration_trip = "Trip duration (mins)", member_casual = "Subscription status"),#assigning variable labels 
    type = all_continuous() ~ "continuous2", #each summary stat to continue on another row
    statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}"))
  )  %>% 
  add_p() # add p value for wilcoxon signed rank test
Characteristic Member
N = 2,637,277
Casual
N = 1,329,510
p-value1
Trip duration (mins)

<0.001
    Mean (SD) 12 (9) 16 (12)
    Median (Q1 - Q3) 9 (5 - 15) 12 (7 - 21)
    Min - Max 1 - 55 1 - 55
1 Wilcoxon rank sum test

Histogram to plot the distribution of trip duration between member vs casual riders

p<- ggplot(trips_data_4, aes(x=duration_trip,fill=member_casual, color = member_casual)) +
  geom_histogram(alpha = 0.5, position="identity", binwidth=1) +
   theme_classic() +
  labs(title = "Histogram of Trip Duration (mins)") +
  xlab ("Trip duration (mins)") +
  ylab ("Count")

p +  scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  scale_x_continuous(breaks = seq(0, 60, by = 10)) 

Since the distribution for both the groups seems to be positively skewed, we use box plot for graphical representation and wilcoxon signed rank test to test if the median trip duration is different between member vs casual riders.

# Boxplot 
p <- ggplot(trips_data_4, aes(x=member_casual, y=duration_trip, fill=member_casual)) + 
  stat_boxplot(geom ='errorbar' , width = 0.5) + 
  geom_boxplot(width = 0.5, outlier.colour="black", outlier.shape = 20, outlier.size=1)+
  labs(title="Box Plot of Trip Duration between Member Vs Casual Riders" , x="Subscription status", y = "Trip duration (mins)")
p + theme_classic() + theme(legend.key.size = unit(1.5, 'cm'), 
                            legend.text = element_text(size = 10))  + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  stat_summary(fun=mean, geom = "point", shape = 2, size = 2, color = "black") +
  annotate(geom = "text", x=2.3, y=50, label = "Wilcoxon rank \n sum test \n p<0.001")

Median trip duration was significantly higher for casual riders as compared to member riders. Majority of the rides taken by members were between 5 to 15 mins whereas casual rides tend to take rides for 7 to 21 mins

Table 3: Comparison of the Type of Bike Between Members and Casual Riders

library(cardx)
trips_data_4 %>%  
  select(bike_type, member_casual) %>% #selecting vars for summary
  tbl_summary(
    by = member_casual, #grouping variable
    label = list(bike_type = "Type of bike", member_casual = "Subscription status"), #assigning variable labels 
    digits = list(all_categorical() ~ c(0, 1)), #display percentage with 1 decimal
  ) %>% 
  add_p() # Add chi-square/fishers exact p value
Characteristic Member
N = 2,637,277
1
Casual
N = 1,329,510
1
p-value2
Type of bike

<0.001
    Classic bike 1,786,184 (67.7%) 853,920 (64.2%)
    Electric bike 851,093 (32.3%) 475,590 (35.8%)
1 n (%)
2 Pearson’s Chi-squared test
# Proportion table for grouped bar chart

tab1 <- table(trips_data_4$member_casual, trips_data_4$bike_type) #frequency table
prop1 <- as.data.frame(prop.table(tab1, margin = 1)) #proportion with denominator as member_casual var
prop1$percent <- as.numeric(format(round(prop1$Freq*100,1),nsmall=1)) # percentage calculation
colnames(prop1) <- c("member_casual","bike_type","prop","percent") #renaming col names

# Grouped bar chart
p <- ggplot(data=prop1, aes(x=bike_type, y=percent, fill=member_casual)) + 
geom_bar(stat="identity", position=position_dodge()) +
  ylim (0,100) + 
  xlab("Type of the bike") + 
  ylab ("Percentage") +
  ggtitle("Usage by Bike Type between Member Vs Casual Riders") +
geom_text(aes(label = percent), size = 3, position = position_dodge(width = .9), vjust = -1, color = "black") 

p + theme_classic() +  scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  annotate(geom = "text", x=2, y=90, label = "Pearson’s Chi-squared test \n p<0.001")

32.3% of the members preferred to ride as compared to 35.8% of the casual riders. 67.7% of the members preferred to ride as compared to 64.2% of the casual riders. Overall, it was observed that the members preferred to ride classic bikes while the casual riders preferred to use the electric bikes.

Table 4: Compare the average ride duration and number of rides by each day of the week for members vs casual users

trips_data_4 %>%  
  select(duration_trip,  week_day_trip, member_casual) %>% #selecting vars for summary
  tbl_strata( strata = week_day_trip, # Stratifying variable
              .tbl_fun = 
                ~ .x %>% 
                tbl_summary(
    by = c(member_casual), #grouping variable
    label = list(duration_trip = "Trip duration (mins)", member_casual = "Subscription status"),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}"))
  )  %>% 
  add_p(), #add p value for wilcoxon signed rank test
  .header = "**{strata}**, N = {n}"
  )
Characteristic Sunday, N = 512155 Monday, N = 527302 Tuesday, N = 557795 Wednesday, N = 602352 Thursday, N = 579634 Friday, N = 572646 Saturday, N = 614903
Member
N = 288,591
Casual
N = 223,564
p-value1 Member
N = 376,567
Casual
N = 150,735
p-value1 Member
N = 411,476
Casual
N = 146,319
p-value1 Member
N = 435,797
Casual
N = 166,555
p-value1 Member
N = 415,366
Casual
N = 164,268
p-value1 Member
N = 374,790
Casual
N = 197,856
p-value1 Member
N = 334,690
Casual
N = 280,213
p-value1
Trip duration (mins)

<0.001

<0.001

<0.001

<0.001

<0.001

<0.001

<0.001
    Mean (SD) 13 (10) 18 (12)
11 (9) 15 (12)
11 (9) 14 (11)
11 (9) 15 (11)
11 (9) 14 (11)
11 (9) 16 (12)
13 (10) 18 (12)
    Median (Q1 - Q3) 10 (6 - 17) 14 (8 - 24)
8 (5 - 14) 12 (7 - 20)
9 (5 - 14) 11 (7 - 19)
9 (5 - 15) 11 (7 - 19)
9 (5 - 14) 11 (7 - 19)
9 (5 - 14) 12 (7 - 21)
10 (6 - 17) 15 (9 - 24)
    Min - Max 1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 Wilcoxon rank sum test
#Calculating average duration for each week day between member and casual riders
t1 <- trips_data_4 %>% 
  group_by(member_casual, week_day_trip) %>%  #groups by subscription status and weekday
  summarise(number_of_rides = n()                           #calculates the number of rides and average duration 
            ,average_duration = mean(duration_trip),
            sd_duration = sd(duration_trip)) %>%        # calculates the average duration
  arrange(member_casual, week_day_trip) 


# Line chart
#Plot of Average Trip Duration by Day of the Week Between Member vs Casual Riders

p <- ggplot(t1, aes(x=week_day_trip, y=average_duration, fill=member_casual)) + 
   geom_col(position="dodge")+
  labs(title="Average Trip Duration by Day of the Week" , 
       x="Day of the Week", y = "Trip duration (mins)") + ylim(0,20)

p + theme_classic() + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) 

This visualization shows that average trip duration among members is highest during weekdays (Monday to Friday), while the average trip duration for casual riders tend to be more on weekends.

#Plot of Number of Rides by Day of the Week Between Member vs Casual Riders

p <- ggplot(t1, aes(x=week_day_trip, y=number_of_rides, fill=member_casual)) + 
  geom_col(position="dodge")+
  labs(title="Number of Rides by Day of the Week" , 
       x="Day of the Week", y = "Number of rides")
p + theme_classic() + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

This visualization shows that bike usage among members is highest during weekdays (Monday to Friday), while casual riders tend to use bikes more on weekends.

Next, we will visualize the average trip duration and the number of rides taken at the specific time of the day.

#Calculating number of rides between member and casual riders by time of the day
t2 <- trips_data_4 %>% 
  group_by(member_casual, hour = hour(time_trip)) %>%  #groups by subscription status and every hour
  mutate(number_of_rides = n(),
         average_duration = mean(duration_trip)) %>% 
  ungroup()

#Plot of Average Trip Duration by Time of the Day Between Member vs Casual Riders
p <- ggplot(t2, aes(x=time_trip, y=average_duration, group=member_casual)) + 
  geom_line(aes(color=member_casual))+ 
 # geom_point(aes(color=member_casual)) +
   #geom_errorbar(aes(y=), width=.2,
                # position=position_dodge(0.05)) +
  labs(title="Average Trip Duration by Time of the Day" , 
       x="Time of the Day", y = "Trip duration (mins)") +
  ylim(0,20)

breaks <- c(0, 6,12,18,24)*3600
labels <- format(as.POSIXct(breaks, tz = "UTC"), format = "%H:%M")

p + theme_classic() + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49"))  +scale_x_time(breaks = breaks, labels = labels)

This visualization indicates that the average trip duration for members is almost constant through out the day whereas the casual riders tend to ride for longer from mid day to evening.

#Plot of Number of rides by Time of the Day Between Member vs Casual Riders
p<-ggplot(t2, aes(x=time_trip, y=number_of_rides, group=member_casual)) +
  geom_line(aes(color=member_casual)) + theme_classic() #+ geom_point(aes(color=member_casual))

breaks <- c(0, 6,12,18,24)*3600
labels <- format(as.POSIXct(breaks, tz = "UTC"), format = "%H:%M")

p + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  labs(title="Number of Rides by Time of the Day" , 
       x="Time of the Day", y = "Number of rides") +scale_x_time(breaks = breaks, labels = labels)

This visualization indicates that members primarily use bikes during peak work hours, whereas casual riders tend to ride mainly in the evenings.

In the following analysis, we will visualize the most frequent start and end stations based on the number of rides, and explore how this distribution varies between member and casual riders.

Visualization for the most common start stations by subscription status

stn_name_final <- trips_data_4 %>% 
  group_by(start_station_id) %>%  #groups by start_station_id
  mutate(start_station_name_new = names(which.max(table(start_station_name)))) %>% 
  ungroup() %>% 
 group_by(end_station_id) %>%  #groups by start_station_id
  mutate(end_station_name_new = names(which.max(table(end_station_name)))) 


#Getting the one set of coordinate for each station id
station_info <- stn_name_final %>% 
   dplyr::group_by(start_station_id, start_station_name_new) %>%  #groups by start_station_id
  mutate (start_lat_final = mean(start_lat), start_lng_final = mean(start_lng)) %>% 
  dplyr::ungroup() %>% 
   dplyr::group_by(end_station_id, end_station_name_new) %>%  #groups by end_station_id
    mutate (end_lat_final = mean(end_lat), end_lng_final = mean(end_lng)) %>% 
  dplyr::ungroup()

# calculating the number of rides for each start station taken by member and casual riders
start_station_unique <- station_info %>% #filter(member_casual == "Casual") %>% 
  group_by(member_casual, start_station_id, start_station_name_new, start_lat_final, start_lng_final) %>%  
  summarise(number_of_rides = n())  %>% #calculates the number of rides 
  ungroup()

end_station_unique <- station_info %>%  #filter(member_casual == "Casual") %>% 
  group_by(member_casual, end_station_id, end_station_name_new, end_lat_final, end_lng_final) %>%  
  summarise(number_of_rides = n())  %>%     #calculates the number of rides 
  ungroup()

start_station_unique <- start_station_unique %>% 
  mutate(numBins = case_when(number_of_rides <= 2500 ~ 1,
                             number_of_rides > 2500 & number_of_rides <= 7500 ~ 2,
                             number_of_rides > 7500 & number_of_rides <= 10000 ~ 3,
                             number_of_rides > 10000 ~ 4)) %>% 
 mutate(numBins = factor(numBins, levels = c(1,2,3,4), labels = c("<2500", "2500 - 7500", "7500 - 10000", "> 10000")))              

end_station_unique <- end_station_unique %>% 
  mutate(numBins = case_when(number_of_rides <= 2500 ~ 1,
                             number_of_rides > 2500 & number_of_rides <= 7500 ~ 2,
                             number_of_rides > 7500 & number_of_rides <= 10000 ~ 3,
                             number_of_rides > 10000 ~ 4)) %>% 
 mutate(numBins = factor(numBins, levels = c(1,2,3,4), labels = c("<2500", "2500 - 7500", "7500 - 10000", "> 10000")))   

  
#install.packages("plotly")
library(plotly)
library(sf)
shape <- read_sf(dsn = "C:/Users/Smrut/OneDrive/Documents/Cyclistic Case Study/Data", 
                 layer = "geo_export_36491657-9a1b-46a0-a113-872487cb2834")
p <- ggplot(shape) +
  geom_sf(fill = "grey", color = "white")

# Visualization for most common start points for rides taken by member vs casual riders

q <- p + geom_point(data=start_station_unique ,aes(x=start_lng_final,y=start_lat_final, color=numBins, 
                                             text = paste("Start station:", start_station_name_new, "\n", "No. of rides:", number_of_rides)),size=1,alpha=0.75) + xlab("Longitude") + ylab("Latitude") + ggtitle("Distribution of Starting Stations with Number of Rides in Chicago") + 
  facet_grid(. ~ member_casual, axes = "all") + 
    scale_colour_brewer(palette = "RdYlGn", direction = -1, name = "Number of Rides")

ggplotly(q, source = "select", tooltip = c("text"), width= 1000, height= 500)

Plot x: Visualization for the most common end stations by subscription status

# Visualization for most common start points for rides taken by member vs casual riders

q <- p + geom_point(data=end_station_unique ,aes(x=end_lng_final,y=end_lat_final, color=numBins, 
                                                   text = paste("End station:", end_station_name_new, "\n", "No. of rides:", number_of_rides)),size=1,alpha=0.75) + 
  xlab("Longitude") + ylab("Latitude") + ggtitle("Distribution of Ending Stations with Number of Rides in Chicago") + 
  facet_grid(. ~ member_casual, axes = "all") + 
  scale_colour_brewer(palette = "RdYlGn", direction = -1, name = "Number of Rides")

ggplotly(q, source = "select", tooltip = c("text"), width= 1000, height= 500)

SHARE

Conclusion

  • Over the last 12 months, close to 4 million rides were taken, with casual riders representing 33.5% and members making up 66.5% of all rides

  • Median trip duration in casual riders was significantly higher as compared to members

  • Majority of the rides taken by members were between 5 to 15 mins whereas casual rides tend to take rides for 7 to 21 mins

  • It was observed that the members preferred to ride classic bikes while the casual riders preferred to use the electric bikes

  • Average trip duration on weekends was reportedly higher as compared to the weekdays for both member and casual riders

  • Members predominantly used bikes on weekdays and during peak work hours, while casual riders primarily preferred to ride on weekends and in the evenings

  • The most popular start and end stations for members, each with over 10,000 rides, were situated in central Chicago as well as the north and south sides. In contrast, casual riders primarily began and finished their rides from stations in central Chicago

ACT

Recommendations

  • The insights from this analysis could be used to use to develop targeted marketing initiatives at the most frequently used start stations by the casual riders

  • Further analysis that incorporates cost-effectiveness could provide insights into the additional spending of casual riders compared to members

  • Given that casual riders favor electric bikes over classic bikes, offering reduced membership fees for electric bike usage could be an effective incentive